Background: Decided to participate in a 36 hour ‘over the weekend’ data science hackathon. No aspirations of placing, just learning. What I’m presenting here is my interpretation of the question, and approach to the problem. (Unfortunately, there was some ambiguity as to the actual structure of the problem*)

Problem Statement

Basically a two-parter. Part A You, a data scientist at an insurance company, are trying to predict the probability of renewal (standard ‘churn’ problem, logistic regression). The interesting part was Part B the optimization: you can spend extra to have the agent give extra attention to clients, and thereby increase their probability of renewal. BUT this relationship was not linear.

pacman::p_load(tidyverse, caret, plotROC, pROC, DMwR2, plotly)
fun.3 <- function(x) .2*(1-exp(-2*(1-exp(-x/400))))
p <- ggplot(data=data.frame(x=0), mapping = aes(x=x))
q <- p + stat_function(fun = fun.3)+xlim(0,1500)+ylim(0,.175)+
  xlab('Incentives Spent to Increase Probability')+
  ylab('Increase in Probability')
ggplotly(q)

The goal is to maximize Total Net Revenue, defined by: the updated Probability, P(baseline + incentive bump) times the Premium on the policy minus whatever you spent on Incentives (summed across all rows).

My approach: Work out the relationships between variables using function approximation. If/when I come across a more elegant/correct way to do the optimization, I’ll update this page or write a part 2.

Considerations

  • Probability ranges from 0-1. If a customer is predicted to have a high probability of renewing their policy (say ~0.99), you cannot incentivize their policy to combined probability > 1. In other words, the maximum renvenue per policy, before subtracting incentives, is the Premium
  • You’ll want to maximize those policies that already have high baseline P(renewal). Because it doesn’t cost very much (curve is steepest closer to 0). That seems silly and probably not something you’d do in real life, BUT the problem did not account for compensation of the agent, so your only goal is to maximize return according to the formula given.
  • The smaller the premium, the smaller the ‘upper limit’ will be on incentives, since the incentive cost-per-%-improvement will provide diminishing returns as the curve flattens out. The optimium ROI will depend on the premium amount (I’ll cover in a minute)

Scoring Metric

I’ll ignore for now, since the dataset isn’t public. Your submission included P(baseline) and incentive per policy in the test set. Scoring was a weighted average that favored the AUC (of the ROC) for your Part A solution (70% of score), and the other ~30% of score was the total net revenue.

EDA + Model

A simple logisitc regression on untransformed variables performed fairly well in this case. Since optimizing a binary classifier wasn’t what I was interested in learning, and is not the focus of this post, I’ll skim over and show you some quick code for getting Part A out of the way. The DMwR2 package in R makes kNN imputation quite easy. There were a handful in both the training and test sets that needed to be addressed.

train <- read.csv('train.csv')
trainCC <- knnImputation(train[,-c(1)])
lr0 <- glm(renewal ~ ., data = trainCC, family = 'binomial')
lr0pred <- predict(lr0, trainCC, type = 'resp')
trainCC$lr0pred <- lr0pred
ggplot(trainCC, aes(d=renewal, m = lr0pred))+geom_roc()

auc <- roc(trainCC$renewal, lr0pred); auc # on training
## 
## Call:
## roc.default(response = trainCC$renewal, predictor = lr0pred)
## 
## Data: lr0pred in 4998 controls (trainCC$renewal 0) < 74855 cases (trainCC$renewal 1).
## Area under the curve: 0.8317

So, that’s enough for now, I’ll make a companion post exploring how to optimize the binary classifier model. But since the best AUC I saw was ~0.85, I call that darn good for a first-pass. FWIW, feature engineering did little to enhance the model, according to several of the top-ranked participants, and my own fiddling (quite frustrating!).

Optimization Strategy

The first step was to figure out the optimum ‘ROI’ for Incentives, across the range of policy premiums. There are two ways to do this, the smart way is to plot this equation, where z = ROI, x = Incentives, and y = Premium. It gives you a 3d surface - find the equation that defines the maximum for each x,y and you’re all set.

fun.3d <- function(x,y) (.2*(1-exp(-2*(1-exp(-x/400))))*y) - x

Except I don’t have the math chops to do that. It sure is pretty though…

incentives <- seq(0,1000,10)
premium <- seq(1200,60000, length.out=101)
roi <- as.matrix(read.csv('surface.csv', header = F))

plot_ly(x=~incentives, 
        y=~premium,
        z=~roi) %>% add_surface() 

So here’s what I did instead: Turned to my trusty friend Excel, and plotted a poor man’s 3d surface. We already know the relationship between Incentives and Increase in Probability, so we can multiply the latter by each premium amount for a sensible range of Incentive values, and deduct the cost to calculate the ROI (see below).

The highlighted table in yellow is a list of values we can now use curve fitting in R, to get a good approximation of that relationship. We can use the output (a function) to tell us the optimal Incentive for each value of Premium (in the range of our dataset anyways).

ideal <- read.csv('ideal.csv', header=T)
colnames(ideal) <- c('premium', 'incentive')
tail(ideal)
##    premium incentive
## 10   20000       580
## 11   25000       650
## 12   30000       700
## 13   40000       790
## 14   50000       860
## 15   60000       920
range(ideal$incentive)
## [1]  20 920
ggplot(ideal, aes(x=premium, y=incentive))+geom_point()

Now, this part will not make sense to most people - but since my background is Pharmacology, I thought “Gee that looks an awful lot like a sigmoidal dose response curve”, so I download the drc package in R to fit the data. I guess you’ll just have to trust me that it makes sense (and I’m sure there are other packages and formulas that would approximate the fit just fine).

library(drc); library(modelr)
drc <- drm(incentive ~ premium, data = ideal, fct = LL.4(), type = 'continuous')
rsquare(drc, ideal)
## [1] 0.9997965
plot(drc, log = '', cex = 2)

Looks close enough for me. Now we can call predict() to get the optimal Incentive as a function of Premium!

Room for Improvement

Now.. What to do about people who are already very likely to renew their policy. This formula works to find the maximum amount to spend, as a function of the premium, to increase the Probability of Renewal for people with a low P(baseline) - ie, unlikely to renew. But, they are a relatively small portion of the dataset (~6%). Performing well here will involve some kind of optimization of policies that are already near P(baseline)=1. One thing I noticed from the Excel sheet was that spending less than the optimal amount never resulted in a LOSS (negative ROI). In other words, it is always worth spending a little to bump up your P(baseline) to 1, even if it was already very high.
So, the first thing to do on the test set is to calculate how much ‘room’ for improvement there is before you hit P(renewal) = 1. That is simply 1 - P(baseline) – which again is just the output of our logistic regression model.

test <- read.csv('test.csv') #read in
testCC <-  knnImputation(test[,-c(1)]) # don't use id for knn
testCC <- cbind(id = test$id, testCC) # add it back
testPred <- predict(lr0, testCC, type = 'response')  # get P(baseline) for test set
testCC$pred <- testPred  # add back to dataframe
testCC$room <- 1 - testCC$pred # calculate room for improvement

Solve for X (?!)

OK, next is where we run into a problem: We have a new maximum for how much we should incentivize, but in units of Y (based on the relationship we already have for those variables, see fun.3 above). AND, the equation is ridiculously complicated, to the point that Wolfram Alpha can’t even solve for X. We could go back to school for a math degree, OR… use another function approximation from observable data points.

y = seq(0,1000, 10) # range of incentives, new Y
deltaP <- fun.3(y) # Increase Prob for range of incentives, new X
solveX <- data.frame(x=deltaP, y=y)
solveX <- solveX[2:101,] # get rid of origin, zeros
tail(solveX)
##             x    y
## 96  0.1673989  950
## 97  0.1675483  960
## 98  0.1676933  970
## 99  0.1678342  980
## 100 0.1679709  990
## 101 0.1681038 1000
plot((solveX$x), log(solveX$y))

Wild. Ok, looks like a 3rd order polynomial might be able to approximate.

model <- lm(log(y) ~ x + I(x^2) + I(x^3) -1, data=solveX) # -1 = no intercept term, force thru origin
summary(model)
## 
## Call:
## lm(formula = log(y) ~ x + I(x^2) + I(x^3) - 1, data = solveX)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.26135 -0.03762 -0.01638  0.06507  1.11193 
## 
## Coefficients:
##         Estimate Std. Error t value Pr(>|t|)    
## x        134.896      2.901   46.50   <2e-16 ***
## I(x^2) -1213.932     45.699  -26.56   <2e-16 ***
## I(x^3)  3904.845    176.162   22.17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1812 on 97 degrees of freedom
## Multiple R-squared:  0.9991, Adjusted R-squared:  0.9991 
## F-statistic: 3.666e+04 on 3 and 97 DF,  p-value: < 2.2e-16

Well, I like the R-squared. Works for me. Now we have a way to go from ‘room’ variable (room for improvement before you hit probability of 1) to Incentive.

model$coefficients
##          x     I(x^2)     I(x^3) 
##   134.8965 -1213.9316  3904.8455
fun.model <- function(x) exp( (model$coefficients[1]*x )+
                                (model$coefficients[2]*(x^2))+
                                (model$coefficients[3]*(x^3)) )

We’ll use this to calculate what the incentive should be when the ‘room’ for improvement is less than what the ideal incentive would suggest.

Here is the code to finish these calculations on the test set to get our predictions. If you followed along so far, this should be no problem.

testX <- testCC[,c(1,12:14)] # dplyr select knitr fail
testX <- testX %>% 
  mutate(ideal_incentive = predict(drc, test[c('id','premium')])) %>% # optimum based on premium
  mutate(room_incentive = fun.model(pmin(room, 0.1669))) %>%  # max spend dictated by headroom
  mutate(incentives = trunc(pmin(ideal_incentive, room_incentive))) %>% # take the lower value
  mutate(new_prob = fun.3(incentives) + pred) %>% # verify
  filter(new_prob >= 1) %>% # uh oh?
  # select(id, premium, pred, room) %>% # failing in knitr
  arrange(desc(new_prob))

head(testX, 10)
##        id premium      pred       room ideal_incentive room_incentive
## 1   73714   26400 0.9243101 0.07568995        665.6823       141.0298
## 2   97179   13800 0.9250901 0.07490988        488.7091       139.0090
## 3   68806    7500 0.9239201 0.07607991        343.1634       142.0297
## 4  103903    7500 0.9235314 0.07646865        343.1634       143.0195
## 5   17160   22200 0.9242994 0.07570062        616.3146       141.0572
## 6    2374    5400 0.9235211 0.07647888        273.4742       143.0455
## 7    6956    5400 0.9239052 0.07609477        273.4742       142.0676
## 8   42637   18000 0.9254659 0.07453412        558.5628       138.0259
## 9   45410    7500 0.9254658 0.07453423        343.1634       138.0262
## 10  32156    5700 0.9239029 0.07609708        284.5246       142.0735
##    incentives new_prob
## 1         141 1.013903
## 2         139 1.013902
## 3         142 1.013900
## 4         143 1.013896
## 5         141 1.013892
## 6         143 1.013885
## 7         142 1.013885
## 8         138 1.013884
## 9         138 1.013884
## 10        142 1.013883
nrow(testX) / nrow(testCC)
## [1] 0.18937
summary(testX$incentives)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    86.0   126.0   128.3   167.0   692.0

Imperfection

So, the caveat is that because my function approximations weren’t perfect, I’ve got a few predicted predicted probabilities (after incentives) that are a hair over 1.0. There is no mechanism in place to detect or penalize for that (aside from lost revenue), so you could leave them as-is, or scale down the incentives by a certain percentage, when the premium is within a certain range.

Performance

From me I got distracted as the deadline approached, and never submitted for a final score. I had made submissions that were on the top ~1/3 of the leaderboard before figuring out the optimization part, but “the platform” for this particular competition is a dud for people coming to learn:

  • doesn’t automatically submit your best score (why??)
  • neither the problem decription nor solution check are up after the competition closes (I thought perhaps because it was a problem used for hiring/interviews, but its the same for all past competitions)
  • THE TYPO- I refered to ambiguity earlier, there was actually a typo in the description of a variable, and the formulas given described increase in probability on a 0-100 scale.
  • No clarification of ambiguity offered during the hackathon despite a Slack channel dedicated to it.

A bit anti-climactic I suppose. Lesson learned: Stick to Kaggle if you’re learning.

I’d love to know the ‘right’ way to go about this problem, so if you have an opinion or a link, please share!